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Abstract: Various phishing problems increase in cyber space with the progress of 
information technology. One of the prominent cyber-attacks rooted in social 
engineering is known as phishing. This malicious activity aims to deceive 
individuals into divulging sensitive information, including credit card details, login 
credentials, and passwords. The main importance of this research is finding the best 
outcome by various machine learning (ML) techniques. This paper uses a Tree 
Classifier (ETC), Forward Selection, Pearson correlation, Logit-LR model and 
Principal_Component_Analysis for feature selection. The Logistic_regression (LR), 
Naive_Bayes (NB), Decision_Tree (DT), K-Nearest Neighbor (K-NN), 
Support_Vector_Machine (SVM), Random_Forest (RF), AdaBoost and Bagging 
classifiers are used for developing the phishing detection model. We have studied 
the model in four cases. Case 1 has 6 commonly selected features by ET, forward 
selection and Pearson's correlation, case 2 has 25 features by logit model, case 3 has 
all features, and case 4 has principal component analysis (3 and 5 components). We 


Introduction 

Phishing continues to increase due to the increasing 
digital system. Phishing crimes primarily use social 
design and innovative misdirection to find out customer 
protection data (Jamil et al., 2018). The customer then 
cheats the customer by entering private data without 
confirmation. So far, phishing attacks have appeared on 
the PC and general stage as frequently as possible. With 
an eye toward reducing the risk of phishing attacks, some 
procedures have been proposed to plan and teach end 
clients to view and differentiate phishing URLs. 
However, they still focus on client practices and 
information on using the basic framework. Product-based 
programmatic strategies are generally used to 
differentiate phishing attacks due to their high accuracy 
and effectiveness (Zhu et al., 2019 ; Jain and Gupta, 
2016; Sharfuddin et al., 2023). The benefit of this 
technique is that very few assets are required on the basic 


framework since there is little responsibility for 


find the highest accuracy of 97.3% in case 2 with the random forest model. 


dissecting the site's content. Nonetheless, this technique 
has difficulties managing recent phishing attacks because 
the repository that holds highly contrastive records is 
built from recently identified URLs. Heuristic phishing 
localization methods can be used to enhance high- 
contrast recordings (Babagoli et al., 2019; Chaurasia and 
Pal, 2021). Then, at that point, because of the separated 
elements, the fundamental AI classifiers are prepared to 
recognize the phishing sites. Classifiers are usually built 
from LR, SVM models, NB models, etc. Using AI 
methods, phishing sites can be effectively identified 
(Chaurasia et al., 2022). In the meantime, it can also 
accommodate recent phishing sites. The way to 
implement this technique is to obtain highly qualified 
elements from phishing URLs and their associated sites. 
Still, unwisely identifying sensitive elements will make it 
impossible for basic classifiers to identify phishing sites 
with certainty. At the same time, some useless or 
ineffective highlights will cause AI technology to fall into 
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the problem of overfitting (Cawley and Talbot, 2010; 
Dawn et al., 2023). 

This paper proposes a hybrid model incorporating 
selection and artificial intelligence, a powerful phishing 
attack detection model. This model selects highlights by 
utilizing feature selection to compute ET, forward 
selection, Pearson correlation, and a measurable logit 
model to evaluate the impact of fine elements on phishing 
site identification. Then, considering these partial 
guarantees that prominent features are derived, the plan 
estimates to pick ideal components from various URLs. 
Finally, we trained the selected LR, NB, DT, KNN, 
SVM, RF, AdaBoost and Bagging algorithms by 
important features to find phishing attacks. Taking 
everything into account, the commitments of this paper 
are recorded as follows- 

(1) Model in view of 3-feature selection methods. To 
more likely evaluate the impact of subtle elements of 
selection on identifying phishing attacks, this paper 
proposes a 3-highlight selection model as a synthetic 
source. The addition of positive and negative highlights 
of the URL characterizes common elements. By 
evaluating 3 - including selection (ET, forward selection 
and Perason correlation), some useless or insignificant 
highlights can be disposed of for the presentation of the 
entire model. 

(2) Model based on logit (Logistic Regression) feature 
selection algorithm. The calculation determines from the 
outset the strengths of all the highlights of the 
information URL and its associated sites. Then, at this 
point, set an edge (p-value < 0.05) to select fine elements 
to develop the ideal element-wise vector. With this 
calculation, many useless and insignificant highlights are 
pruned away. Since these repeated highlights are not 
exacerbated, the overfitting problem of the base classifier 
is reduced. ML classifiers outperform many existing 
frameworks in phishing site discovery. 

(3) Analysis based on all features. Through the ML 
classifiers, the phishing dataset was evaluated with full 
features to compare with other models. The comparison 
with other models shows the importance of features in 
various models. 

(4) Model based on Principal Component Analysis (3 
and 5 component). Through a selection of sensitive 
highlights and numerous experimental investigations, the 
ideal design of the ML classifier is prepared and 
constructed as the model's final classifier. 

In addition to a large number of works, SMOTE 
(Synthetic Minority Oversampling Technique) is used to 
similarly propagate information between classes, and the 
problem of representation is likely to be the main 
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problem to solve to prevent misclassification due to 
heavily skewed classes. Destroyed is an oversampling 
strategy to increase the minority class in the dataset 
(Guan et al., 2021). 

The rest of this paper organized into various sections 
as: Related work is represented in section 2. Features 
selection is represented in section 3. Training ideas are 
generated in sections 4 and section 5, and 6, representing 
discussion and conclusion. 


Related Work 

Although there are procedures for preparing and 
presenting to the end customer, the programming-based 
custom zone approach is the most notable technique for 
addressing the risk of phishing attacks. Mohammed et al. 
proposed an_ intelligent  self-coordinated neural 
association for identifying phishing regions (Mohammad 
et al., 2014). They showed 17 components of 600 real 
and 800 phishing destinations accumulated documents, 
isolated with the help of outcasts. Their tests demonstrate 
neural tissue's high generalization and power in phishing 
identification (Mohammad et al., 2013). 18 components 
are demonstrated for 859 valid and 969 phishing locales, 
respectively. Considering the whitelist, Kang and Lee 
(2007) proposed a system to distinguish phishing 
destinations. This method determines the client's 
authority over the site by distinguishing between URL 
proximity. Jain and Gupta (2018) proposed an AI-based 
strategy to identify phishing sites using only client-side 
elements. Towards detecting phishing websites on the 
client using a machine learning approach. They removed 
some web-based sections, and banking locales evaluated 
their methods. Sharifi and Siadati (2008) proposed a 
counter-generator strategy to identify phishing sites. This 
strategy determines if a site is phishing by matching its 
zone name with Google's listing. While the 
abovementioned checks recommend various elements to 
identify phishing sites, some may not fully characterize 
phishing incidents (Rajab, 2018). Essentially, Babagoli et 
al.use a comparable informational index and propose 
include determination utilizing choice trees and the 
covering strategy, which brings about choosing 20 
highlights (Han et al., 2016 ; Babagoli et al., 2019). They 
assess the phishing location execution utilizing a novel 
meta-heuristic-based nonlinear relapse calculation. All 
things considered, the component determination 
strategies proposed by these checks are informed by 
information and require client-indicated edge values. Lee 
et al. (2014) proposed the PhishTrack structure for 
subsequent recovery of phishing against blacklists. High- 
contrast recording consumes very few assets on the base 
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frame. Nonetheless, it cannot schedule recent phishing 
attacks as expected (Aleroud and Zhou, 2017). Khonji et 


al. (2013) explored a few element determination 


Table 1. Comparison of some learning algorithms of previous research. 


creators separated their proposed strategy into two stages: 
they utilized the aggregate conveyance work slope (CDF- 
G) in the main stage to create essential highlights and 


Dataset Used 1= aaa 
Reference Fe eaves Method proposed on 
legitimate ,0= phishing 
Accuracy 
Fette et al., 2007 6950 (1) & 860 (0) LIBSVM 99% 
Abu-Nimeh et al., 2007 1700 (0) & 1700(1) LR, CART, BART, SVM, RF & NN | 95.11% 
Chandrasekaran et al., 2006 100 (0) & 100 (1) SVM 95% 
Jameel and George, 2013 (3000 (0) & 3000 (1) FFNN 98.72% 
Rathod and Pattewar, 2015 2500 (1) & 2100 (0) NB 96.46% 
Rawal et al., 2017 1605 (0) & 414 (1) RF & SVM 99.87% 
Shyni et al., 2016 5260 e-mails SVM, RF & LB 96.3% 
Smadi et al., 2015 5000 (0) & 5000 (1) J48 98.11% 
Mbah, 2017 6951(1) & 2357 (0) KNN & J48 93.11% 
Hota et al., 2018 1824 (0) & 1604 (1) C4.5 & CART 99.27% 
Fang et al., 2019 778101) & 999(0) RCNN 99.848 % 
Aljofey et al., 2020 3000(1) & 3000(0) RCNN 95.02% 
Sonowal, 2020 1824 (0) & 1604 (1) BSFS 97.41% 
Bagui et al., 2021 3416 (0) & 14950 (1) CNN 95.97% 


strategies for distinguishing email phishing, including 
Relief and relationship-based component choice. 
Attempted computations included RF, SVM and DT 
strategies; irregular out-of-the-way acquisition techniques 
beat the others. AI technology is widely examined to 
identify phishing sites due to its dynamic learning 
capabilities. Zhang et al. (2007) is a phishing attack 
discovery model based on 27 sensitive removed from 
URLs. The model uses TF-IDF calculations to identify 
phishing attacks. The research computation can identify 
many kinds of phishing attacks, but it is not conducive to 
the large time cost of consuming hidden frameworks. 
Some legitimate websites were considered phishing 
during this period (Li et al., 2016). Basnet et al. (2012) 
tried two notable element selection strategies: overlay 
and CFS. Insatiable forward guarantees and regular 
computations are used to evaluate components removed 
from web pages and web crawlers. To evaluate the 
feasibility of the component selection technique, the 
creators used three AI computations, specifically, LR, 
RF, and NB. The outcomes show that the covering 
highlight choice strategy performed better than CFC 
regarding exactness. Compared to Cantina, CANTINAC 
adds 10 additional elements. Meanwhile, phishing 
revealed that SMV replaced TF-IDF computation. With 
these upgrades, Cantina's deficiencies can be addressed. 
In any case, the new CANTINAC has a strict scope of 
use (Nguyen et al., 2014). Chiew et al. (2019) proposed 
another determination 


component technique called 


The 


crossover gathering highlight choice (HEFS). 
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these elements were taken care of in the subsequent stage 
addressed by an information bother ensemble with the 
high capacity perturbation ensemble to deliver the other 
part of elements. A bunch of standard elements evaluated 
in these two stages are processed into the AI calculations 
used to differentiate phishing. The chosen highlights were 
taken care of in a few AI calculations; the best calculation 
regarding exactness was an RF that, when utilized with 
the standard elements, acquired a precision of 94.6%. The 
creators utilized two tests to approve their proposed 
strategy. As a programmed phishing discovery model, 
PhishStorm (Marchal et al., 2014) is carried out as a 
connection point between informal communication 
apparatuses and email servers. This section prepares RF 
algorithms by separating 12 significant URL elements. 
Still, it fails to identify multiple phishing attacks as it 
rarely contains sensitive highlights (Leskovec et al., 
2010). 

In this section, we have studied the performance of 
The 
results of this analysis move and represent more values of 
precision when websites with different Classifications are 


anti-phishing and machine-learning algorithms. 


detected. Various algorithms have been used in Table 1 to 
enhance the accuracy, and the comparative study of the 
literature is presented. 


Material and Methods 

A URL is the address used for the presentation locale. 
A typical URL contains four parts: program, region 
name, recording method, and query limit (Leskovec et al., 
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2010). Internet resources can be accessed based on the 
address displayed by the URL. By eliminating subtle 
highlights in real and phishing URLs and their associated 
regions, critical ML classifiers are ready to identify 
phishing attacks. As shown in Figure 1, the calculation is 
first used to eliminate the features and its huge objection 
in the information test URL. The picking estimate is then 
used to create the optimal parts. Techniques for picking 
ideal components can reduce the obligation to set up 
basic ML classifiers. A test set of revised URLs that 
contain ideal components is the final step in the process 
of developing a new classifier. 


Phishing Web 
Data Set 


(Multiple 
Features) 


Forward Selection 

Forward selection is a coverage model that typically 
investigates a component's foresight capabilities and 
returns many components that perform surprisingly well 
(Bokrantz et al., (2020). 
Pearson Correlation 

Pearson Correlation is used to develop an association 
matrix that activates a direct connection between two 
features and gives a value between -1 and 1, showing 
how correlated the two components are to each other 
(Seo and Shneiderman, 2005). 


Optimal 
Features 


Figure 1. Workflow for generating optimal features from a phishing website dataset. 


Feature Selection 

All illegal URLs are treated as authenticated URLs by 
they can trick 
customers into sending phishing attacks quickly (Gupta et 
al., 2018). Luckily, not quite the same as legitimate 


phishing attackers. By doing this, 


URLs, phishing URLs have clear, recognizable elements. 
Following techniques are used as feature selection 
techniques. 
Extra Tree Classifier 

Extra Tree Classifier or Extremely Random Tree 
Classifier is a social event computation that seeds tree 
models built from a planning dataset for arbitrary reasons 
and sorts out the best-fit components (Tama and Lim, 
2020). In Figure 2, out of 30 features (30 independent and 
1 dependent), only 10 prominent ones are selected. 


HTTP! 
AnchorURL 


WebsiteTraffic 
SubDomains 
PrefixSuffix- 

LinksinScriptTags 
RequestURL 
ServerFormHandler 
LinksPointingToPage 
DomainRegLen 


Figure 2. Feature selected by Extra Tree Classifiers 
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Statistical logit (LR) Model 

Logit, or computed recurrence, is a quantifiable basic 
fundamental ability to display dual dependent variables in 
its basic design (Mahapatra et al., 2022). In a backward 
check, the logit model is investigated to determine the 
limits of the model. We only consider those features 
whose p-value is less than 0.05. The five features 
(Redirecting//, DomainRegLen, Favicon, AbnormalURL 
and UsingPopupWindow) have a p-value greater than 
0.05, so we can drop those features and consider 
prominent 25 features for further analysis. 


Principal Componant Analysis (PCA) 

PCA is a dimensionality descent strategy that spends 
most of its time reducing the dimensionality of a huge 
heuristic list, by changing a variable huge plan to a more 
humble plan that really contains large in huge set 
(Chaurasiaet al., 2021). Eliminating elements or features 
of a file often leads to a weakness in precision. No need 
to deal with unnecessary variables due to more 
inconspicuous guiding classifications that are less easy to 
examine and imagine, and make the separation of data 
estimated by AI clearer and faster. 

The selected features and their numbers are presented 
in the following table 2 from the above-mentioned feature 
selection techniques. 
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Table 2. Selected features and their numbers by 
different feature selection techniques 


No. of selected 


Algorithms Rests 
Extra Tree Classifier 10 
Forward Selection 6 
Pearson Correlation 10 
Logit (LR) 25 


From ET classifier, Forward Selection and Pearson 
correlation, the hypothesis suggests that PrefixSuffix-, 
SubDomains, HTTPS, AnchorURL, ServerFormHandler 
and Website Traffic are 6 important features. At the same 
time, the logit (LR) model suggests 25 most significant 
features. 

The following 4 cases were studied to discover the 
features most significant for predicting the phishing 
websites. 

Case-1: Analysis of common features shared by Extra 
Tree classifier, Forward Selection and Pearson 
correlation. 

Case-2: Analysis of features selected by the logit (LR) 
model. 

Case-3: Analysis of all features. 

Case -4: Analysis of features selected by PCA (3 and 
5 components). 

The AI calculations are planned so that they gain as a 
matter of fact and their presentation improves as they 
feed on an ever-increasing amount of information. Each 
calculation has its own specific manner of learning and 
anticipating the information. In this section, we will 
discuss the working of following AI calculations and a 
portion of the numerical conditions carried out in those 
calculations that are used in the learning system. 

Logistic Regression 

LR is an action estimation procedure that measures 
the after-effects of total variables considering 
independent elements. It is basically used for analysing 
and fitting data to fundamental constraints. The 
likelihood increase depends on the free factors’ 
coefficients within the determined limits. Gradient 
descent values decide the cost limit. 

cost(hg(x), y) = —log(he(x)) cee | 
cost(hg(x),y) = —log(1 — hg (x)) if y= 0 
Cost function of LR 
1 
~ 1+ e~ BotBiXi+B2%2BaXn) 
LR Equation 


P 


Naive Bayes 
NB is a planning computation that relies on Bayes' 
that no 


relationship exists between the autonomous components. 
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theorem. The assessment acknowledges 


That is, the specific condition in which a part in one 
selection is independent in the presence of another 
section in a similar class. We make a repeating table for 
all tags against the class and calculate the probability of a 
large number of pointers. With NB conditions, the backs 
of all courses are not set. The highest probability of all 
classes evaluates the result of the NB classifier. 
P(x|c)P(c) 

P(x) 

P(c|x) = P(x,|c) X P(x,|c) x ...X P(x,|c) x P(c) 

Where, c—>class, x—predictor 

Decision Tree 

The decision tree is basically used for action or 
decision for regrate. Decision tree helps evaluate the 


P(c|x) = 


dataset quality by calculation entropy and information 
gain. We have used techniques to split the dataset for 
quality observation. The DT use the Gini Index as an 


important model. 
Cc 


Entropy = >. 
i=1 
Where, c—No. of classes 


Cc 
Gini = 1—- >. (p;)? 

i=1 

K-Nearest Neighbor 
Regression and classification are both techniques used 

for data quality estimation. The algorithm helps in 
The 
algorithms are also used for the highest distance as the 
manhattam distance. 


—p; X log2(Vi) 


estimating the value of Euclidean distance. 


k 
> (x; — yj)? Euclidean 
i=1 


k 
> |x; —y;| Manhattan 


1=1 


k 
Ox: — y|)%) /4 Minkowski 
i=1 
Distance Metrics 
SVM 
Furthermore, SVM analyzed the learning scheme, 
gave details of the problem, and accessed an ideal 
hyperplane space. The 
augmenting edge distances between class views using 
hinge disaster work. The part of the hyperplane depends 


in N-layer ideal plan is 


on how much information is highlighted. If N represents 


features, N-1 will represent hyperplane. 
ma. 


yF# 
We calculate the loss function as t, which represents 


a 
l(y) = max (0,1 + tWyX — WeX) 


the target variable, w, which represents the model 
parameter and x, which is the input variable. 
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Logit (LR) 


a 
Phishing Website 
Dataset 


Dataset with Feature 
Selection Algorithms 
Extra Tree Classifier 
Forward Selection 
Pearson Correlation 


PCA (3 & 5 Component) 


Dataset with All Features 


ML Classifiers 
Logistic Regression 
Naive Bayes 
Decision Tree 
K-nearest Neighbor 
SVM 

Random Forest 
AdaBoost 

Bagging 


Figure 3. Flow diagram of working method 


Random Forest 

Random forests collect various decision trees and 
work as an ensemble for a special model. In RF, the 
output prepared as special class and votting is converted 
into a gauge for RF for each decision. Some techniques 
for ensuring this are through bagging and feature 
selection. Storage is the process of picking out sporadic 
instances of insight from a dataset. 
AdaBoost 

AdaBoost computation, short for Adb is used as an 
ensemble technique in AI. It is called adaptive boosting 
because the heap is reallocated to each case, with a higher 
burden assigned to the badly requested model. Support is 
used to reduce tendencies and shakes in supervised 
learning. It applies the rules that the learner gradually 
grows up. With the exception of the principal, each 
subsequent learner is made up of learners who actually 
develop. In clear words, weak learners become strong 
learners. AdaBoost is estimated to work on a comparison 
rule as a slightly qualified support. 
Bagging 

Bagging produces extra information for preparing the 
dataset. This is accomplished by irregular inspecting with 
substitution from the first dataset. Inspecting with 
substitution might rehash a few perceptions in each new 
preparation informational collection. Each component in 
Bagging is similarly likely to show up in a new dataset. 

These multi-datasets are utilized to prepare different 
models in equal. The normal of the multitude of 
expectations from various group models is determined. A 
significant portion of the votes obtained through the 
democratic system are considered to be as follows: when 
the order is established, showing reduces the difference 
and adjusts the forecast to something that is more typical. 
Flow model of work 

Fig. 3 shows the work process of recognizing phishing 
assaults of the mode. As is displayed here, the phishing 
site dataset is separated by two strategies. In the first 


method, several important algorithms are used, and the 
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only select prominent and sensitive features. And in the 
second method, the dataset carry forward with full 
features in the next level, where several machine learning 
classifiers were applied. Following that, a comparison 
was made between the accuracy results provided by the 
machine learning classifiers in each method. We see that 
the feature selection method is more appropriate. 
Description of Datasets 

The information classification is the pick structure 
UCI Phishing Guide Classification (UCI Repository). 
Instructive records come from 11,054 models, with 
55.69% legitimate URLs and 44.31% phishing URLs. 
Meanwhile, 80% of the direct 11054 models are used to 
set the classifier, and 20% of the models are used to 
evaluate the presentation of the model. In fact, a modest 
number of tests can cause conclusive classifiers to suffer 
from underfitting and weak hypothesis constraints. 
Contrary to the norm, when all instances of instructive 
classification are used for preparation, the classifier will 
fall into the problem of overfitting and powerless action 
results. 

Results and Discussion 

In case 1, the Extra Tree classifier, Forward Selection 
and Pearson relationship model chose the elements 
(Prefix Suffix-, Sub Domains, HTTPS, Anchor URL, 
Server Form Handler, Website Traffic). A few ML 
classifiers (LR, NB, DT, KNN, SVM, RF, AdaBoost and 
Bagging) were applied for precision assessment. As 
shown in Table 3, the most elevated exactness (93.86%) 
attracts between two classifiers, for example, Random 
forest and Bagging. 

In case 2, 25 features (Table 1) were chosen by the 
logit (LR) model. The same classifiers were applied to 
this situation. In precision astute, we can find in Table 2 
that random forest again got higher exactness (97.30%) 
among every one of the classifiers. 

In case 3, all dataset features were taken to extract the 
precision of the classifiers. Again Random forests gain 


higher precision, for example, 97.10% (table 3). 
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Table 3. Obtained accuracy in different cases. 


= el oe] G2] el = 
. ee eee 
2 2 
log) re) 
Logistic 0.913 | 0.927 | 0.9267 | 0.81 | 0.9242 
Regression 7 1 41 
Naive Bayes | 0.908 | 0.914 | 0.9097 | 0.82 | 0.9025 
2 4 11 
Decision 0.936 | 0.942 | 0.9293 | 0.85 | 0.9384 
Tree 2 5 80 
K-Nearest 0.917 | 0.965 | 0.9598 | 0.89 | 0.9510 
Neighbor 6 0 78 
SVM 0.936 | 0.952 | 0.9483 | 0.83 | 0.9278 
1 2 17 
Random 0.938 | 0.973 | 0.9710 | 0.95 | 0.9748 
Forest 6 0 44 
AdaBoost 0.931 | 0.939 | 0.9131 | 0.82 | 0.9230 
0 0 80 
Bagging 0.938 | 0.969 | 0.9457 | 0.94 | 0.9703 
6 7 42 


In case 4, try is led with Principal component analysis 
(3 and 5 parts). The Random forest again got the higher 
exactness (95.44% with 3-component and 97.48% with 5- 
component). Figure 4 is drawn for better comprehension 


of the results created by Table 2. 

Since in all cases in general, the case-2 logit (LR) 
model with 25 highlights has performed well. So, we 
have drawn a ROC (AUC) curve inside different 
classifiers. The RF, LR, NB, DT, SVC, AdaBoost and 
Bagging classifier get a higher accuracy (AUC) of 


approximately 100% (Figure 5). 
ROC Curve Analysis 


ae “LogisticRegression, AUC=1.000 

as -°— GaussianNB, AUC=1.000 
DecisionTreeClassifier, AUC=1.000 
KNeighborsClassifier, AUC=0.997 
SVC, AUC=1.000 
RandomForestClassifier, AUC=1.000 
AdaBoostClassifier, AUC=1.000 

oo f BaggingClassifier, AUC=1.000 


True Positive Rate 


\ 


ao a1 a2 a3 a4 as a6 a7 as ag 10 
Flase Positive Rate 


Figure 5. ROC (AUC) curve by logit (LR) model with 
25 features. 


Conclusion and Future Work 

In this research, a four-feature selection calculation 
and principal inspection are first characterized to analyse 
the impact of the impact of fine highlights on phishing 
recognition. Then, at this point, given the results of these 
component selection processes, the ideal element 
determination computation aims to find a specific idea 
value of the vector for machine learning techniques. The 
evaluation can handle a large number of fishing-sensitive 


Accuracy of Classifiers in Diffrent Cases 


® Case-1 (6 Features) 
® Case-3 (All Features) 


 Case-4 (PCA) 5-componants 


® Case-2 (Logit) (25 Features) 
@ Case-4 (PCA) 3-componants 


Figure 4. Accuracy of classifiers in different cases 
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elements and changing highlights. Subsequently, it can 
alleviate the over-fitting problem of ML classifiers. 
Finally, through trial-and-error investigation, the ideal 
machine learning classifier is ready to identify phishing 
attacks. Two important outcomes resulted from this 
experiment. First, we can conclude that among the entire 
feature selection model, the Logit (LR) model with 25 
features in case 2 performed well (Table 3), and 
secondly, RF calculated a high score among all the 
classifiers in entire cases (Table 3). So, the Logit (LR) 
model for feature selection and Random forest for 
accuracy measurement could be more appropriate for 
detecting phishing websites. As the subtle elements of a 
phishing attack continue to change, collecting more 
elements later for ideal element determination is 
important. 
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